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demonstrate the efficiency of the method both on Hermitian and non-Hermitian matrices. 
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1. The overlap operator and the sign function 

In this contribution we present an improved method to compute the sign function of a large 
sparse matrix, as necessitated by the overlap Dirac operator in lattice QCD. More details and results 
can be found in Ref. [1]. The new method applies to Hermitian and non-Hermitian matrices and is 
therefore usable in QCD simulations at both zero and nonzero quark chemical potential. 

The overlap operator introduced by Neuberger and Narayanan [2, 3] and extended to nonzero 
quark chemical potential fi by Bloch and Wettig [4] is defined as 

D 0V (At) = l + 7 5 sgn(7 5 D w ( i u)), (1.1) 

with Wilson-Dirac operator D w (/i) and Wilson mass satisfying —2 < m w < 0. The quark chemical 
potential is introduced in the Wilson-Dirac operator as prescribed by Hasenfratz and Karsch [5]. 
The matrix YsD w is Hermitian when jJ. =0, but becomes non-Hermitian when /i / 0, such that the 
overlap operator requires the computation of the sign of a general complex matrix. 

For a generic function / and a diagonalizable matrix A = U diag(Ai , • • • , X n )U~ l G C" x ", with 
eigenvalues A,- and eigenvector matrix U = [ui \ . . . \u n ], the matrix function f(A) is given by the 
spectral definition 

f(A) = C/diagCftAO, . . . J{K)) U-K (1.2) 

In the non-Hermitian case the eigenvalues are generically complex and the sign function is denned 

as sgn(z) = sgn(Re(z)) [6]. 

2. Approximate Krylov subspace solution 

For physically relevant lattice sizes the spectral definition (1.2) cannot be applied directly as 
the diagonalization of A becomes too expensive in terms of CPU time and memory. A solution is to 
compute f{A)x, forx G C", using a Krylov subspace approximation, rather than f{A), as the results 
of such operations are typically needed by iterative solvers for linear systems and eigensystems. 

To approximate f{A)x in the Krylov subspace Jf?k(A,x) = span(;e,A;e, . . . ,A k ~ l x) we first con- 
struct a basis for the subspace. In the Hermitian case the Lanczos method generates an orthonormal 
basis Vk of J%(A,jc) using short recurrences. In the non-Hermitian case an orthonormal basis can 
be constructed using the Arnoldi algorithm. However, as the method uses long recurrences, it be- 
comes too expensive for large Krylov subspaces. Instead, we will use the two-sided or biorthogonal 
Lanczos method (2sL), where the orthogonality of the basis is given up in order to recover short 
recurrences. Two short recurrence relations are used to construct mutually orthonormal bases V k 
and W k , i.e., W^V k = l k , for the Krylov subspaces of A and A^, respectively. 

With these bases the Ritz approximation to f{A)x is computed using 

f(A)x « V k f(H k )V*x = \\x\\V k f(H k )ef\ (2.1) 

where we chose vi = x/||x||, ep is the first unit vector of C^, and the k x k Ritz matrix H k is 
defined as V^AV k and W^AV k , for the Hermitian and non-Hermitian case, respectively. The virtue 
of the Krylov subspace approximation is that a good accuracy can be reached for k <^n, such that 
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the computation of f(A) is replaced by that of f(Hk), which is of much smaller size. For the sign 
function, sgn(/4) is typically computed using the Roberts-Higham matrix iteration [7]: 

S n+l = l -{S n +S- 1 ) with S =A, (2.2) 

which converges quadratically to sgn(A). The combination of Eq.(2.1) with Eq. (2.2) to compute 
sgn(/4) will be called the non-nested method, in contrast to the nested method introduced in Sec. 3. 

Some additional care has to be taken when using the approximation (2.1) for the sign func- 
tion. If A has eigenvalues close to the discontinuity along the imaginary axis, the Krylov subspace 
would have to be taken very large to achieve a good accuracy with Eq. (2.1). This can be re- 
solved by treating these critical eigenvalues exactly using deflation [8]. Assume that we have 
computed m such critical eigenvalues A, with their corresponding right and left eigenvectors r, and 
/,, then the function evaluation is rewritten as f(A)x = R m f{h m )(lJ m x) + f(A)(I — R m Lj n )x, where 
A m = diag(Ai , . . . , A m ) , R m = [ri | . . . \r m ] and L m = [h \ . . . | l m ) . The first contribution to f(A)x is 
computed exactly using the eigensolutions for the deflated eigenvalues, while the second contribu- 
tion is approximated using the Krylov subspace approximation (2.1). 

The non-nested approximation has been implemented and tested on configurations ranging 
from 4 4 to 16 3 x 32 lattices [8- 10]. The performance of the method is illustrated for a typical 
8 4 configuration in Fig. 1 . As expected, the convergence is generically faster and smoother in the 
Hermitian case (Lanczos) than in the non-Hermitian case (2sL), see Fig. la. More relevant for the 
current study is Fig. lb which depicts the CPU times used in both cases. The figure highlights 
a serious issue with the non-nested method, as a bottleneck in the numerical computation can 
clearly be identified, in both the Hermitian and non-Hermitian case. The difference between the 
full and dashed lines corresponds to the time needed to compute sgn(/4) using the iteration (2.2). 
Evidently, this computation takes up a substantial amount of the total CPU time, as the €?(lc 3 ) 
complexity makes the Roberts-Higham iteration overly expensive for large Krylov subspaces. In 
this talk we will present a new method which alleviates this problem by computing sgn(i4)e[ of 
Eq. (2.1) using an additional Krylov subspace level. 




Figure 1: Accuracy versus (a) Krylov subspace size (left) and (b) CPU time (right) for a typical 8 4 config- 
uration in the Hermitian (/I = 0) and non-Hermitian (ji = 0.3) cases, with deflation gap A = 0.055. In the 
right panel the full lines show the total time required to compute sgn(A)x, while the dashed lines give the 
time needed to build the bases in the Krylov subspaces. 
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3. Nested Krylov subspace method for the sign function 

The idea is to approximate the vector sgn(^)ep of dimension k in a smaller, nested Krylov 
subspace (H k , e f ) of dimension I using the approximation (2.1), with A — > H k and x — > gj : 



sgn(A)jc« HjcllVtsgn^)^ » ||^||^sgn(// f )4 



(3.1) 



where sgn(//<?) is computed using Eq. (2.2). As it stands, Eq. (3.1) does not improve the efficiency 
of the original method because the inner Krylov subspace J^(fljt,ep) only contains information 
coming from the I x i upper left corner of H k , due to the tridiagonal nature of H k and the special 
source vector ej \ Therefore, the size of the inner Krylov subspace has to be chosen t « k in order 
to achieve the full accuracy of the outer Krylov subspace, and nothing has been gained. 

There is, however, a way to circumvent this problem and make the idea work. We observe that 



sgn(A) = sgn(A+A~ 



(3.2) 



since the eigenvectors of both arguments are the same, and the transformation does not change the 
sign of the eigenvalues, i.e., 
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sgnRe(z) = sgnz, (3.3) 



for z € C. Based on the property (3.2) we introduce a preconditioning step H k — > + H. in 
Eq. (2.1), which enables us to use the nested approximation (3.1) with inner Krylov subspace 



jef{H k +H k l ,ef ] ) instead of jei{H k ,e\ K '). As we will see in Sec. 4, the nested approximation 
with preconditioning step works well with I <C k, i.e., there is no noticeable loss in accuracy even 
after reducing the Krylov subspace size substantially. This increased efficiency can be understood 
by noting that the transformation H k — ► Hu+H^ 1 improves the condition number by a factor ten, 
approximately. This is illustrated in Fig. 2 showing the spectra of H k and H k +H k ~ l for the Hermi- 
tian case. Clearly, the preconditioning step considerably widens the gap around the origin, causing 
the improvement in condition number. 
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Figure 2: Eigenvalue distributions of (top) and H k +H k 1 (bottom) for a 6 4 lattice with k = 1024. 
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Figure 3: Convergence of the nested method for an 8 4 lattice in the Hermitian case (deflation gap A = 0.055). 
The blue line shows the accuracy £ of the non-nested approximation versus the outer Krylov subspace size 
k. The vertical line picks a value for k, here k = 1536, corresponding to a desired accuracy. Finally, the red 
line shows the accuracy of the nested method as a function of the inner Krylov subspace size I for fixed k. 




4. Numerical results 



In this section we show some preliminary numerical results for the nested Krylov subspace 
method. The gain in efficiency of the method is characterized by the smallest value of £/k for which 
no relevant loss of accuracy occurs. This is illustrated in Fig. 3, which shows the convergence of 
the nested method. The crucial feature is that, for fixed k, the accuracy of the nested approximation 
remains optimal over a very large range in £ until the error eventually blows up when the inner 
Krylov subspace becomes too small. From the figure one observes that I can be chosen about ten 
times smaller than k without affecting the accuracy of the approximation. Fig. 4 illustrates how this 
reduction translates in a gain in CPU time. For fixed k, the inner Krylov subspace size £ is varied 
and the corresponding accuracy and CPU time can be read from the figure. The large gain in CPU 
time achieved when reducing I is due to the 0(£ 3 ) cost to compute sgn(///>). It is remarkable that 
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Figure 4: Gain in CPU time of the nested method for an 8 4 lattice in the Hermitian case (left) with k = 1536 
(A = 0.055) and the non-Hermitian case (right) for \i = 0.3 with k — 1600 (A = 0.11). The red line shows 
the accuracy e(£), for a fixed outer Krylov subspace size k. The total CPU time used by the method is 
given by the full blue line, while the horizontal dotted line represents the time needed to construct the outer 
Krylov subspace. The difference between both corresponds to the time needed to compute sgn(H() using 
the Roberts-Higham iteration and is of order The vertical band shows the range of optimal L 
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there is a region of i, given by the vertical band, where the accuracy is still maximal but where the 
computation time of sga(Hk)ef is negligible. This makes the nested method extremely efficient, 
at least for the lattice sizes considered in these preliminary tests. 

To explore the employability of the method for realistic lattice calculations it is useful to 
investigate how the algorithm scales with the lattice volume. This is illustrated in Fig. 5. Fitting the 
CPU time gives an approximate volume dependence t ~ V 1 2 for the Hermitian case and/ ~y 14 for 
the non-Hermitian case. In this context we also compare the efficiency of our new method with the 
rational approximation methods, which are the best methods currently on the market, see Fig. 6. For 
the Hermitian case the nested method performs slightly better than the Zolotarev method. However, 
the difference is not significant and conclusions may depend on the details of the implementation 
(especially since the nested method is sometimes used in double pass mode, depending on the 
available hardware, to avoid storage problems for large lattices [1]). For the non-Hermitian case, 
where we compare with the rational approximation method presented in Ref. [10], the conclusion is 
unambiguous and the nested method is by far better than the rational approximation. This is caused 
by the lesser efficiency of the rational approximation in the presence of complex eigenvalues [10]. 




Figure 5: Scaling with matrix size: CPU time versus lattice volume for the Hermitian case (left) and non- 
Hermitian case with /i = 0.3 (right) for three different deflation gaps: 0.025, 0.5 and 0.1. The requested 
accuracy is e = 10~ 8 . The dashed lines represent double pass calculations [1]. 




Figure 6: Comparison of the CPU time used by the nested method and rational approximation methods: For 
the Hermitian case (left) we compare the nested method with the Zolotarev approximation [1 1], for the non- 
Hermitian case (right) we compare with the Neuberger approximation computed with a multishift, restarted 
FOM method [10]. The requested accuracy is e = 10~ 8 and the deflation gap is A = 0.05. 
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5. Summary and outlook 

The Ritz approximation to the sign function slows down dramatically when the Krylov sub- 
space grows large. We therefore developed an improvement based on nested Krylov subspaces, 
which resolves this problem and expedites the computation of the sign function for Hermitian and 
non-Hermitian matrices, without affecting the accuracy of the approximation. Moreover, the new 
method turns out to be a worthy alternative to state-of-the-art rational approximation methods. 
More details about the nested method can be found in Ref. [1]. 

Future developments will include the parallel implementation and benchmarking of the nested 
method, its incorporation in hybrid Monte Carlo algorithms for dynamical simulations with overlap 
fermions, and the investigation of the applicability of the nested method to other matrix functions. 
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